You work as a Data Scientist at a P-2-P platform that was founded a year ago. Now you want to expand your business. Your team has split up and each analyst has a subset of the data. Your job is to find out insights for your platform in an exploratory data analysis.
Your business model is to run a platform (crowd-investing) where people who have a business idea, but not the money needed, can sign up and raise money for their project within a given time. On the other hand, you have funders who would like to invest their money in projects and are looking for investments. As an intermediary, your platform brings borrowers and lenders together. You earn your money with a commission for each project that lands on your platform.
Your database is the history of your platform. All projects are completed projects, i.e. the time to raise money for your project has expired. Your business model is to pay out the collected money even if the target amount is not reached.
The original data set has 671205 data points and 13 features, thus covering a memory of 450.775664 mb.
The split record contains the following columns (incl. meaning):
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
data_1=pd.read_csv('input/your_part1.csv',
sep='/n',
engine='python',
nrows=2)
data_1
| , funded_amount, loan_amount, activity, sector, use, country_code, country, region, currency, term_in_months, lender_count, borrower_genders, repayment_interval | |
|---|---|
| 0 | 0,300.0,300.0,Fruits & Vegetables,Food,"To buy... |
| 1 | 1,575.0,575.0,Rickshaw,Transportation,to repai... |
data_1=pd.read_csv('input/your_part1.csv',
sep=',',
index_col=0)
data_1.tail(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 334995 | 7125.0 | 7125.0 | Plastics Sales | Retail | to stock up with a dozen plastic tubs and buck... | CD | The Democratic Republic of the Congo | Goma, North Kivu province | USD | 6.0 | 15 | female, female, female, female, female, female... | irregular |
| 334996 | 125.0 | 125.0 | Home Appliances | Personal Use | to buy a water filter to provide safe drinking... | KH | Cambodia | Phnom Penh | KHR | 8.0 | 5 | male, male, male | monthly |
| 334997 | 925.0 | 925.0 | Cattle | Agriculture | to increase his herd to get more milk and prov... | TJ | Tajikistan | Khuroson | TJS | 14.0 | 28 | male | monthly |
| 334998 | 175.0 | 175.0 | General Store | Retail | to buy additional items to sell, like beverage... | PH | Philippines | Palo, Leyte | PHP | 8.0 | 6 | female | irregular |
| 334999 | 150.0 | 150.0 | Food Production/Sales | Food | to buy sticky rice and sugar. | PH | Philippines | Jagna, Bohol | PHP | 8.0 | 5 | female | irregular |
data_2=pd.read_csv('input/your_part2.csv',
sep='/n',
engine='python',
nrows=2)
data_2
| # funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval | |
|---|---|
| 0 | 0#175.0#175.0#Liquor Store / Off-License#Food#... |
| 1 | 1#325.0#325.0#Livestock#Agriculture#to buy 3 z... |
data_2 = pd.read_csv('input/your_part2.csv',
sep='#',
index_col=0)
data_2.tail(5)
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 336200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 336201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 336202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 336203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 336204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
Aim: Prove wheather the concat() function is the right method to join the two data set
data_1.columns == data_2.columns
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True])
data_1.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7,
8, 9,
...
334990, 334991, 334992, 334993, 334994, 334995, 334996, 334997,
334998, 334999],
dtype='int64', length=335000)
data_2.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7,
8, 9,
...
336195, 336196, 336197, 336198, 336199, 336200, 336201, 336202,
336203, 336204],
dtype='int64', length=336205)
data_concat = pd.concat([data_1, data_2])
data_concat
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 336201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 336202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 336203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 336204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671205 rows × 13 columns
# check for correctness
len(data_1)+len(data_2) == len(data_concat)
True
# check for correctness
data_concat.loc[0,:]
| funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 0 | 175.0 | 175.0 | Liquor Store / Off-License | Food | to purchase additional stock of coconut wine t... | PH | Philippines | Palo, Leyte | PHP | 8.0 | 6 | female | irregular |
data_concat.reset_index(inplace=True) # reset the indexes
data_concat
| index | funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 671200 | 336200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 671201 | 336201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
| 671202 | 336202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671203 | 336203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
| 671204 | 336204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671205 rows × 14 columns
data = data_concat.drop(columns=['index']) # remove the initial index column
data.shape
(671205, 13)
data.columns
Index([' funded_amount', ' loan_amount', ' activity', ' sector', ' use',
' country_code', ' country', ' region', ' currency', ' term_in_months',
' lender_count', ' borrower_genders', ' repayment_interval'],
dtype='object')
# display the memory usage
print(f'The data set has a memory of {data.memory_usage(deep=True).sum()*1e-6} mb.')
The data set has a memory of 450.775664 mb.
# remove the free space
data.columns = data.columns.str.replace(' ', '')
# check the results
data.columns
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
'country_code', 'country', 'region', 'currency', 'term_in_months',
'lender_count', 'borrower_genders', 'repayment_interval'],
dtype='object')
data['country'].unique()
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
'Vanuatu', 'Panama', 'Virgin Islands',
'Saint Vincent and the Grenadines',
"Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
data.loc[data['country'] == "Lao People's Democratic Republic", 'country'] = 'Lao People\'s Democratic Republic'
data.dtypes
funded_amount float64 loan_amount float64 activity object sector object use object country_code object country object region object currency object term_in_months float64 lender_count int64 borrower_genders object repayment_interval object dtype: object
# rename the headers
data.columns = ['Funded', 'Requested', 'Activity', 'Sector', 'Use', 'Country code',
'Country', 'Region', 'Currency', 'Term in months', 'Lenders',
'Borrower genders', 'Repayment interval']
Since there are no entries that could be considered as unique values and that could be of help to identify duplicted data points, therefore, as duplicates are going to be interpreted those rows whose contents fully mirror one another.
print(f'{data.duplicated().sum()} data points may be disposed as if they are duplicates.')
24372 data points may be disposed of if they are duplicates.
To Do:
Duplicates complicate the analysis of platform activities as they increase data redundancy and distort the data. As well, the gained insights may not be trustworthy and may, for example, lead to financial decisions that are not traceable. So, the duplicates are going to be cleaned off.
Check the correctness of the action
data.drop_duplicates(inplace=True)
data.reset_index(drop=True)
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
| 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
| 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
| 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
| 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646828 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
| 646829 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | female | monthly |
| 646830 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly |
| 646831 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly |
| 646832 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly |
646833 rows × 13 columns
# check the result
print(f'The data set contains {data.duplicated().sum()} duplicates.')
The data set contains 0 duplicates.
df = data.copy() # save a copy of the data set
data.shape == df.shape # check for the shape to assure the correctness of the saved copy
True
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 646833 entries, 0 to 671188 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Funded 646833 non-null float64 1 Requested 646833 non-null float64 2 Activity 646833 non-null object 3 Sector 646833 non-null object 4 Use 642934 non-null object 5 Country code 646825 non-null object 6 Country 646833 non-null object 7 Region 590670 non-null object 8 Currency 646833 non-null object 9 Term in months 646833 non-null float64 10 Lenders 646833 non-null int64 11 Borrower genders 642945 non-null object 12 Repayment interval 646833 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 69.1+ MB
Interpretation:
The presence of NaNs in the following list does not exclude the columns from beeing investigated for synonyms.
The data points of the following columns is less than the total data points, so the following 4 columns include NaNs:
The following remaining columns of object data type are still questioned on the consistency of missing values:
All numeric columns are of float or integer type, thus these do not contain synonyms.
df.nunique()
Funded 610 Requested 479 Activity 163 Sector 15 Use 424912 Country code 86 Country 87 Region 12695 Currency 67 Term in months 148 Lenders 503 Borrower genders 11298 Repayment interval 4 dtype: int64
df.describe(include='all')
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 646833.000000 | 646833.000000 | 646833 | 646833 | 642934 | 646825 | 646833 | 590670 | 646833 | 646833.000000 | 646833.000000 | 642945 | 646833 |
| unique | NaN | NaN | 163 | 15 | 424912 | 86 | 87 | 12695 | 67 | NaN | NaN | 11298 | 4 |
| top | NaN | NaN | Farming | Agriculture | to buy a water filter to provide safe drinking... | PH | Philippines | Lahore | PHP | NaN | NaN | female | monthly |
| freq | NaN | NaN | 69497 | 174624 | 2074 | 154323 | 154323 | 7168 | 154322 | NaN | NaN | 415237 | 330448 |
| mean | 807.845170 | 866.239238 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.718920 | 21.110709 | NaN | NaN |
| std | 1145.283451 | 1214.135953 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.479172 | 28.837152 | NaN | NaN |
| min | 0.000000 | 25.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | NaN | NaN |
| 25% | 275.000000 | 275.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.000000 | 7.000000 | NaN | NaN |
| 50% | 475.000000 | 500.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.000000 | 13.000000 | NaN | NaN |
| 75% | 925.000000 | 1000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14.000000 | 25.000000 | NaN | NaN |
| max | 100000.000000 | 100000.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 158.000000 | 2986.000000 | NaN | NaN |
nan_sum = df.isna().sum() # List of columns with NaNs and contained number of NaNs
nan_sum[nan_sum > 0].sort_values(ascending=False)
Region 56163 Use 3899 Borrower genders 3888 Country code 8 dtype: int64
nan_stat = df.isna().sum() # Distribution of NaNs in each column
np.round(nan_stat[nan_stat > 0]/df.shape[0],5).sort_values(ascending=False)
Region 0.08683 Use 0.00603 Borrower genders 0.00601 Country code 0.00001 dtype: float64
filter_data = data[data.isin(['missing_type', '?', 'unknown', 'missing type', 'dummy', 'missing_values', 'missing values', np.inf, -np.inf]).any(axis='columns')]
filter_data
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval |
|---|
# extract the data points containing the NaNs
data.loc[data['Country code'].isnull()]
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
| 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
| 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
| 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
| 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
| 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NaN | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
| 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NaN | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
| 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NaN | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
# check wheather ISO abbreviations have not been recorded only for Namibia
(data['Country'] == 'Namibia').sum() == data['Country code'].isnull().sum()
True
# Replace the NaNS with the corresponding abbreviation
data.loc[data['Country code'].isnull(), 'Country code'] = 'NA'
# 1. check
data.loc[data['Country code'].isnull()]
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval |
|---|
# 2. check
data.loc[data['Country'] == 'Namibia']
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
| 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
| 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
| 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
| 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NA | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
| 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NA | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
| 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NA | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
| 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NA | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
data['Region'].value_counts().head(50)
Lahore 7168 Kaduna 5573 Rawalpindi 4344 Cusco 3838 Dar es Salaam 3719 Kisii 3468 Narra, Palawan 3155 Medellín 2999 Palo, Leyte 2940 Quezon, Palawan 2929 San Miguel 2850 Brookes Point, Palawan 2753 Tacloban City, Leyte 2736 San Gabriel 2734 Kabankalan, Negros Occidental 2728 Kitale 2689 Thanh Hoá 2647 Eldoret 2597 Cordova, Cebu 2594 Webuye 2513 Multan 2473 Bais, Negros Oriental 2444 Roxas City, Capiz 2436 Managua 2403 Solola 2383 Banga, Aklan 2315 Kisumu 2308 Gotera 2300 Nakuru 2287 Hinigaran, Negros Occidental 2282 Usulután 2230 Likoni 2219 Isabela, Negros Occidental 2209 Dumaguete, Negros Oriental 2202 Ciudad El Triunfo 2176 San Lorenzo 2163 Kampala 2111 Calbayog City, Samar 2052 La Paz 2042 Tanjay, Negros Oriental 2020 Tulcán 1996 Kasese 1985 Calatrava, Negros Occidental 1943 Fort Portal 1941 Portoviejo 1924 Himamaylan, Negros Occidental 1922 Minglanilla, Cebu 1907 Barranquilla 1906 Tangerang 1905 El Transito 1842 Name: Region, dtype: int64
Interpretation:
This column contains mainly regions, but also municipalities or other smaller administrative units are represented.
The names are given mainly in English, but also in the local language.
There are also abbreviated words, numbers or information about the type of administrative units.
In some cases there are no spaces between punctuation marks and values.
Since the entries in the region column do not follow a strict convention and it is difficult to divide the values into different types of departments, it is not possible to substitute them.
# look for
data.loc[:,'Region'].value_counts(dropna=False)
NaN 56163
Lahore 7168
Kaduna 5573
Rawalpindi 4344
Cusco 3838
...
Maplewood 1
Ifakara 1
Sioma 1
Say village, Batken region 1
alejandria 1
Name: Region, Length: 12696, dtype: int64
# see some samples containing missing values and analyse them in relation to other columns
data.loc[data.loc[:,'Use'].isnull()]
| Funded | Requested | Activity | Sector | Use | Country code | Country | Region | Currency | Term in months | Lenders | Borrower genders | Repayment interval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 140 | 2975.0 | 2975.0 | Food Production/Sales | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 110 | NaN | monthly |
| 145 | 1200.0 | 1200.0 | Personal Expenses | Personal Use | NaN | PE | Peru | NaN | PEN | 20.0 | 44 | NaN | monthly |
| 170 | 4250.0 | 4250.0 | Catering | Food | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 116 | NaN | monthly |
| 412 | 2350.0 | 2350.0 | Beauty Salon | Services | NaN | TZ | Tanzania | NaN | TZS | 10.0 | 75 | NaN | monthly |
| 414 | 725.0 | 725.0 | Agriculture | Agriculture | NaN | SV | El Salvador | NaN | USD | 20.0 | 19 | NaN | monthly |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 659604 | 5625.0 | 10000.0 | Weaving | Arts | NaN | BT | Bhutan | NaN | USD | 14.0 | 210 | NaN | irregular |
| 660788 | 1975.0 | 1975.0 | Home Energy | Personal Use | NaN | PS | Palestine | NaN | USD | 27.0 | 39 | NaN | monthly |
| 661718 | 800.0 | 1600.0 | Furniture Making | Manufacturing | NaN | HT | Haiti | NaN | HTG | 13.0 | 27 | NaN | irregular |
| 671151 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
| 671174 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
3899 rows × 13 columns
Interpretation:
Based on the mentions of this column in the introductory chapter, it could be argued that the values do not follow any recording rule, i.e., the documentation is rather deliberate and reminiscent of bags of comments. Moreover, there are 424912 unique valeus in this column. Therefore, it is difficult to replace the missing values with specific values (most common words) or alternatives.
Analyzing the Use column in agreement with the other columns, region and borrowe, genders, these also contain NaNs. In the 'Region' subsection, it was noted that there are countries that occur only once. By removing the rows with NaNs, important categories/values could be disregarded, making the findings incomplete.
These columns could help in the later exploratory analyses to gain insights or better understand the graphs, e.g., some word clouds could be created for a selection of interested countries or sectors.
Based on this analysis, it would be better to leave the missing values in this column and the column itself in its original state.
data.dtypes
Funded float64 Requested float64 Activity object Sector object Use object Country code object Country object Region object Currency object Term in months float64 Lenders int64 Borrower genders object Repayment interval object dtype: object
loan_describe = data['Requested'].describe()
loan_describe
count 646833.000000 mean 866.239238 std 1214.135953 min 25.000000 25% 275.000000 50% 500.000000 75% 1000.000000 max 100000.000000 Name: Requested, dtype: float64
fig = px.scatter(data_frame=data, x= 'Country', y='Requested', title='Overview over requested credits')
fig.update_layout(yaxis=dict(tickmode = 'linear', tick0 = 0.00, dtick = 25000.00))
fig.show()